Skip to content

feat: add persistent custom operator registry#968

Open
cmgzn wants to merge 8 commits intodatajuicer:mainfrom
cmgzn:feat/persistent-custom-operators
Open

feat: add persistent custom operator registry#968
cmgzn wants to merge 8 commits intodatajuicer:mainfrom
cmgzn:feat/persistent-custom-operators

Conversation

@cmgzn
Copy link
Copy Markdown
Collaborator

@cmgzn cmgzn commented Apr 15, 2026

Summary

Add a persistent JSON-based registry (~/.data_juicer/custom_op.json) so that user-defined custom operators survive across processes without requiring re-registration on every run.

Motivation

Previously, custom operators had to be re-loaded via config every time a process started. This made it cumbersome to work with reusable custom ops across sessions, scripts, and CLI invocations.

Changes

  • data_juicer/utils/custom_op.py (new) — Core module for persistent custom op management:

    • JSON registry at ~/.data_juicer/custom_op.json storing source paths keyed by op name
    • load_persistent_custom_ops() replays registrations on startup, auto-cleaning stale entries
    • CLI interface: python -m data_juicer.utils.custom_op {list,register,unregister,reset}
    • Dynamic module/package loading extracted from config.py
  • data_juicer/utils/registry.py — Add unregister_module() to Registry class

  • data_juicer/ops/__init__.py — Call load_persistent_custom_ops() at import time after built-in ops are loaded

  • data_juicer/config/config.py — Replace inline loading logic with a re-export from custom_op for backward compatibility

  • data_juicer/tools/op_search.py — Harden OPRecord to handle custom ops with non-standard module paths, missing source files, and absent test files

  • data_juicer/tools/DJ_mcp_granular_ops.py — Adapt MCP tooling for enhanced OPRecord fields

  • docs/DeveloperGuide.md, docs/DeveloperGuide_ZH.md — Document the new persistent registration workflow

Testing

  • tests/utils/test_custom_op.py
  • tests/tools/test_op_search.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a persistent custom operator registry, enabling externally developed operators to be registered and automatically loaded across sessions. Key changes include the addition of data_juicer.utils.custom_op for registry management, an enhanced CLI for the operator search tool, and improved parameter handling in MCP tool generation. Review feedback identifies safety concerns regarding the manual cleanup of sys.modules during operator unregistration and highlights inconsistencies between the documentation and implementation regarding registry filenames and environment variables. Additionally, more robust error handling was recommended for resolving operator source paths.

Comment thread data_juicer/utils/custom_op.py Outdated
Comment thread data_juicer/utils/custom_op.py Outdated
Comment thread docs/DeveloperGuide.md
Comment thread docs/DeveloperGuide_ZH.md
Comment thread data_juicer/tools/op_search.py Outdated
except Exception as e:
# Clean up partially-initialized module to avoid stale entries
sys.modules.pop(module_name, None)
raise RuntimeError(f"Error loading '{abs_path}' as '{module_name}': {e}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if we need to rollback OPERATORS here

Copy link
Copy Markdown
Collaborator

@ShenQianli ShenQianli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the persistence model is at the operator level, but the real loading unit is the module/package path -> unreliable for package-based custom ops, relative imports, and multi-operator modules -> i think persistence should be path-based instead.

@cmgzn cmgzn force-pushed the feat/persistent-custom-operators branch from 21f6185 to 1e075cd Compare April 16, 2026 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants